Skip to content

Replace dual-sync-async persistence panic with Watch contract#4436

Open
joostjager wants to merge 7 commits intolightningdevkit:mainfrom
joostjager:revert-mixed-mode-check
Open

Replace dual-sync-async persistence panic with Watch contract#4436
joostjager wants to merge 7 commits intolightningdevkit:mainfrom
joostjager:revert-mixed-mode-check

Conversation

@joostjager
Copy link
Contributor

@joostjager joostjager commented Feb 24, 2026

Fix false panic by relaxing the Watch contract from a global "pick one mode" restriction to a per-channel rule: don't return Completed while prior InProgress updates are still pending for that channel.

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Feb 24, 2026

👋 I see @TheBlueMatt was un-assigned.
If you'd like another reviewer assignment, please click here.

@joostjager
Copy link
Contributor Author

joostjager commented Feb 24, 2026

I didn't add a dedicated test for the InProgress -> complete -> Completed progression since it's already well-covered by existing tests that exercise this flow without disabling the contract check: test_monitor_update_fail_cs, test_monitor_update_fail_no_rebroadcast, test_monitor_update_fail_reestablish, monitor_failed_no_reestablish_response, test_monitor_update_fail_claim, test_monitor_update_on_pending_forwards, monitor_update_claim_fail_no_response, do_during_funding_monitor_fail, test_path_paused_mpp, test_temporary_error_during_shutdown, double_temp_error, test_manager_persisted_post_outbound_edge_holding_cell.

The ChainMonitor override path (where update_monitor fails and the return is overridden to InProgress) is also exercised without disabling the check by: test_monitor_and_persister_update_fail, test_concurrent_monitor_claim, test_update_err_monitor_lockdown.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch 2 times, most recently from c20c819 to 77a6a30 Compare February 24, 2026 11:07
@joostjager joostjager marked this pull request as ready for review February 24, 2026 12:07
@codecov
Copy link

codecov bot commented Feb 24, 2026

Codecov Report

❌ Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 86.00%. Comparing base (647c3f7) to head (b15785c).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/ln/channelmanager.rs 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4436      +/-   ##
==========================================
+ Coverage   85.98%   86.00%   +0.02%     
==========================================
  Files         159      159              
  Lines      105415   105454      +39     
  Branches   105415   105454      +39     
==========================================
+ Hits        90643    90700      +57     
+ Misses      12255    12234      -21     
- Partials     2517     2520       +3     
Flag Coverage Δ
tests 86.00% <98.11%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's more context in the PR description than the commit message, please update the commit message with the context (PR descriptions should almost always just be the commit messages themselves!). Also, IIRC the chanmon_consistency target is conservative to avoid violating the old API, we should update it to allow returning Completed more often to match the API changes here.

@ldk-reviews-bot
Copy link

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 77a6a30 to 218e89a Compare February 24, 2026 15:11
@joostjager
Copy link
Contributor Author

Updated fuzzer

// bit-twiddling mutations to have similar effects. This is probably overkill, but no
// harm in doing so.
0x00 => {
*mon_style[0].borrow_mut() = ChannelMonitorUpdateStatus::InProgress;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably drop mon_style again, no? Given we added it in the being-basically-reverted commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently only used to remember what style to use after a node reload. But I could read from update_ret too and set that again on the new persister.

We could also reduce the number of command bytes by defining a command to toggle the mode.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 218e89a to 7e2b509 Compare February 24, 2026 19:24
@joostjager
Copy link
Contributor Author

joostjager commented Feb 24, 2026

Pushed two commits that remove mon_style and also save command bytes by toggling the style.

The only functional change is that nodes now always start in sync mode, but the fuzzer can immediately toggle it to async. On reload, the style is carried over.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DId you run the fuzzer for a while? I'm not sure that we shouldn't be able to hit a failure like the below.

/// [`Persist`]/[`Watch`] without first restarting.
/// A [`Watch`] implementation must not return this for a channel update if there are still
/// pending [`InProgress`] updates for that channel. That is, an update can only be considered
/// complete once all prior updates have also completed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thinking about this a bit more there's a weird race. Async updates are "completed" only after release_pending_monitor_events returns the completed update, but also only after it has been processed, which may be some arbitrary and indeterminate amount of time later.

For the ChainMonitor return case, this is fine (though we need to update ChainMonitor - we should probably be returning InProgress for any new update that comes after one for which we have to return a spurious InProgress. Do you already intend to do that in a followup? should we not do it here?). But for the general API its pretty weird - you can in theory return a Completed after returning InProgress as long as you want some indeterminate and arbitrary amount of time, so in practice you can't...

We could fix this by adding some mutual exclusion in channelmanager.rs where we wait before processing Completed monitor updates until after any pending MonitorEvents are processed - this should be basically zero cost as ~no one is going to be using mixed-async-sync persistence in practice so we'll rarely if ever have any contention.

On the flip side, we could say that implementations are allowed to flip from Completed to InProgress for a channel, but never back (without a restart). Its more complicated to document, but it captures what we need to allow the ChainMonitor behavior.

Copy link
Contributor Author

@joostjager joostjager Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementations are allowed to flip from Completed to InProgress for a channel, but never back

This sounds like the easier way. There is no need to support mixing sync and async for Persist implementations. I do wonder now whether my initial version #4435 wasn't just sufficient. Moving the mixed mode check to the chainmonitor level and keeping everything else the same.

Fuzzer run was planned for tonight.

Copy link
Contributor Author

@joostjager joostjager Feb 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The per-channel mode latch turned out to be not ideal because it required either a new lock on ChannelManager or threading the latch through deeply nested call chains via an already-held lock, all to accommodate ChainMonitor internally using InProgress creatively to signal a monitor update failure rather than async persistence.

Perhaps it is better to make this case explicit in the API? #4445

You were definitely right about the fuzzer. Woke up to lots of errors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning Err from Watch::update_channel doesn't work for a future remote ChainMonitor setup where the ChannelMonitor lives in a separate process. In that case, update_channel returns InProgress and you only find out asynchronously whether the monitor update succeeded. The API was previously changed away from returning Result for this reason (as explained by @TheBlueMatt)

Went with the option to just document that switching back from async to sync is impractical.

Also reverted the fuzzer changes since the fuzzer freely switching between sync and async modes per node is impractical for the same reason.

Changes pushed.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch 2 times, most recently from c37be93 to 4cf848c Compare February 26, 2026 09:10
@joostjager joostjager added the weekly goal Someone wants to land this this week label Feb 26, 2026
Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't we still need to update ChainMonitor? IIUC it can currently switch from Completed to InProgress if a monitor update application Errs, but then it'll immediately switch back for a later update to the same channel, which is still not allowed.

/// [`InProgress`] back to [`Completed`] for a given channel is only valid after all prior
/// updates have finished.
///
/// Note that an update is only considered complete from [`ChannelManager`]'s perspective once
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO its not worth getting into this level of detail. Better to just say that you canot switch from InProgress to Completed without a restart, but you can go the other way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's updated. I'd have preferred to keep the more extensive docs though. The interaction between persistence completion, MonitorEvent::Completed processing, and switching update status isn't completely straightforward, and I think spelling it out is helpful.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 4cf848c to 0d1e978 Compare February 26, 2026 13:06
@joostjager joostjager self-assigned this Feb 26, 2026
@joostjager
Copy link
Contributor Author

Discussed offline that attention is needed for the case where ChainMonitor override a result to InProgress after which a further update is supposed to be applied to the monitor and expected to succeed (preimage as an example).

@joostjager
Copy link
Contributor Author

joostjager commented Feb 26, 2026

Furthermore there also might be a problem with switching modes after a restart. If ChannelManager was saved with in_flight_monitor_updates, a Completed result after the restart may fail the mode check...

@TheBlueMatt
Copy link
Collaborator

Furthermore there also might be a problem with switching modes after a restart. If ChannelManager was saved with in_flight_monitor_updates, a Completed result after the restart may fail the mode check...

Ugh, yea, we'll have to explicitly allow this. Its fine, because those updates will get applied on startup before we generate any new ones, just awkward.

Copy link
Collaborator

@TheBlueMatt TheBlueMatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As expressed before, I am quite worried about the level of complexity that we are in with async monitor updates. This PR is one example, but #4434 is another, and the deferred writing in #4351 isn't going to make it easier. It is a cost in terms of development time, but potentially also stability of the software.

Yea, there's no question there's a lot of complexity here. On the flip side its also become a critical feature for performance of LDK. I think we should first think critically about whether there's anything we can do to make the logic more straightforward (for example, maybe move Channel inside ChannelMonitor entirely, which would remove the "multiple places to look for HTLC state (resolution)" issue we have here).

}

if update_res.is_err() {
if monitor.no_further_updates_allowed()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we change this from update_res.is_err?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we still can't return Completed for a successful post-close update if we haven't processed the events. We'd hit the mode check.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but the existing code already returned InProgress? I don't see why this swap made a difference to the return value (this is different from persist_res)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing code didn't always return InProgress. We could have a closed channel monitor (that the chan mgr wasn't aware of yet), being fed first a new commitment (returning error), followed by a preimage update (returning ok) that would, in the existing code, propagate Completed up to the chan mgr, triggering the mode check?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, sorry, yea, that makes sense. My concern, which maybe wasn't super clear, is that we might have some reason why we might want to return an Err from the update even though no_further_updates_allowed is false. no_further_updates_allowed is kinda "peeking past the curtain" here, instead I'd very much prefer to check update_res.is_err() and also if there are items in deferred_monitor_update_completion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how we should deal with the residual is_err cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, mmm, okay, then maybe we should check both? We can also add a debug_assert that if it errors we also have no_further_updates_allowed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure what to do for the is_err case if we check both. Added the debug_assert.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, not sure either honestly. ISTM we should at least try? Do what I described but accept the race as possible if we hit the case (which we shouldn't and we debug_assert it wont)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added condition

@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 81b0841 to e841659 Compare March 9, 2026 15:58
joostjager and others added 2 commits March 11, 2026 12:22
Commit 0760f99 ("Disallow dual-sync-async persistence without
restarting") added a panic in non-test builds when a Persist
implementation returns both Completed and InProgress from the same
ChannelManager instance. However, this check runs against the status
that ChainMonitor returns to ChannelManager, not the raw Persist
result. When ChannelMonitor::update_monitor fails (e.g. a
counterparty commitment_signed arrives after a funding spend
confirms), ChainMonitor persists the full monitor successfully but
overrides the return value to InProgress. If the user's Persist impl
only ever returns Completed, this override triggers a false
mode-mismatch panic.

This replaces the panic with a per-channel contract at the Watch
trait level: a Watch implementation must not return Completed for a
channel update while prior InProgress updates are still pending.
Switching from Completed to InProgress is always allowed, but
switching back is impractical because the Watch implementation cannot
observe when ChannelManager has finished processing a
MonitorEvent::Completed. The documentation on
ChannelMonitorUpdateStatus is updated to describe these rules.

The mode tracking and panic checks from 0760f99 are removed and
replaced with a panic that validates the new contract directly on
the in-flight update state. Legacy tests that switch the persister
between modes mid-flight can opt out via
Node::disable_monitor_completeness_assertion().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a regression test that reproduces the panic when a commitment_signed
is processed after the counterparty commitment transaction has confirmed.
The ChannelMonitor's no_further_updates_allowed() returns true, causing
update_monitor to fail, which ChainMonitor overrides to InProgress. A
subsequent preimage claim returning Completed then triggers the
per-channel assertion that Completed must not follow InProgress.

AI tools were used in preparing this commit.
@joostjager joostjager force-pushed the revert-mixed-mode-check branch 2 times, most recently from 0d11279 to 20d3090 Compare March 12, 2026 15:04
When no_further_updates_allowed() is true and the persister returns
Completed, ChainMonitor now overrides the return to InProgress and
queues the update_id in deferred_monitor_update_completions. These
are resolved in release_pending_monitor_events, so ChannelManager
sees them complete together with the force-close MonitorEvents.

This eliminates phantom InProgress entries that would never complete:
previously, a rejected pre-close update (e.g. commitment_signed
arriving after funding spend) returned InProgress with no completion
path, blocking MonitorUpdateCompletionActions (PaymentClaimed,
PaymentForwarded) indefinitely. A subsequent post-close update
returning Completed would then violate the in-order completion
invariant.

AI tools were used in preparing this commit.
@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 20d3090 to b15785c Compare March 12, 2026 19:19
@ldk-claude-review-bot
Copy link
Collaborator

Automated Review

Two issues flagged as inline comments:

  1. Behavior change when update_monitor returns Err — the removal of the update_res.is_err() override silently changes behavior for step-level monitor update failures (non-post-close errors). See inline comment on chainmonitor.rs.

  2. Doc/assertion mismatch on Completed — the updated doc on ChannelMonitorUpdateStatus::Completed describes a stricter contract than what the code enforces. See inline comment on mod.rs.

The core deferral logic (deferred completions in release_pending_monitor_events, interaction with channel_monitor_updated, lock ordering) appears correct.

🤖 Generated with Claude Code

}

if update_res.is_err() {
if monitor.no_further_updates_allowed()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if we do so we'll have to hold the deferred_monitor_update_completion mutex through the entire monitor updating process, but I think that's fine.

// force-close processing) also gets deferred and resolved in the next event cycle.
for monitor_state in monitors.values() {
let deferred =
monitor_state.deferred_monitor_update_completions.lock().unwrap().split_off(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to hold this while we call get_and_clear_pending_monitor_events so that we can't race. That would imply merging the loops but I believe that's fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, also in update_channel.

@joostjager joostjager force-pushed the revert-mixed-mode-check branch from 72020f8 to e50facd Compare March 16, 2026 09:17
@ldk-claude-review-bot
Copy link
Collaborator

Review

The deferral mechanism and locking strategy are well thought out overall. Two items worth attention:

  1. Lock granularity in deferred drain loop (inline comment): pending_monitor_updates is locked/unlocked per iteration in release_pending_monitor_events. Holding it for the entire loop is both more efficient and eliminates a theoretical race with channel_monitor_updated.

  2. Removed update_res.is_err() fallback (inline comment): The old InProgress return on any update_monitor error is replaced by the no_further_updates_allowed() deferral. This covers the main case but changes behavior for rare internal monitor update failures. Practical impact is low.

🤖 Generated with Claude Code

@ldk-claude-review-bot
Copy link
Collaborator

Automated Review

Two issues found, both in chainmonitor.rs update_channel:

  1. Dropped update_res.is_err() override (bug) — The old code returned InProgress when update_monitor() failed, freezing the channel. The replacement no_further_updates_allowed() check doesn't cover this case for non-post-close channels, potentially reporting Completed to ChannelManager when the monitor update actually failed.

  2. Race between channel_monitor_updated and deferred drain loop — The pending_monitor_updates lock is acquired/released per drain iteration, while channel_monitor_updated can run concurrently from another thread. This could produce duplicate Completed events or premature completion signals.

See inline comments for details.

@joostjager joostjager requested a review from TheBlueMatt March 16, 2026 09:55
Comment on lines +1489 to +1505
for update_id in deferred.drain(..) {
pending.retain(|id| *id != update_id);
let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);
if !monitor_is_pending_updates {
let funding_txo = monitor_state.monitor.get_funding_txo();
let channel_id = monitor_state.monitor.channel_id();
pending_monitor_events.push((
funding_txo,
channel_id,
vec![MonitorEvent::Completed {
funding_txo,
channel_id,
monitor_update_id: monitor_state.monitor.get_latest_update_id(),
}],
monitor_state.monitor.get_counterparty_node_id(),
));
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential duplicate MonitorEvent::Completed: if channel_monitor_updated (public API, or via get_and_clear_completed_updates) processes a deferred update_id before this drain runs, that update will already have been removed from pending and a Completed event will already have been emitted. Then this drain loop will call retain (no-op since the id is gone), see pending is empty, and emit a second Completed for the same monitor.

In practice this requires a buggy Persist implementation reporting the same update both synchronously (returning Completed) and asynchronously (via get_and_clear_completed_updates), or external misuse of force_channel_monitor_updated. The impact is low since ChannelManager::channel_monitor_updated handles duplicate Completed events idempotently. But a defensive check would be cheap:

Suggested change
for update_id in deferred.drain(..) {
pending.retain(|id| *id != update_id);
let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);
if !monitor_is_pending_updates {
let funding_txo = monitor_state.monitor.get_funding_txo();
let channel_id = monitor_state.monitor.channel_id();
pending_monitor_events.push((
funding_txo,
channel_id,
vec![MonitorEvent::Completed {
funding_txo,
channel_id,
monitor_update_id: monitor_state.monitor.get_latest_update_id(),
}],
monitor_state.monitor.get_counterparty_node_id(),
));
}
for update_id in deferred.drain(..) {
let prev_len = pending.len();
pending.retain(|id| *id != update_id);
if prev_len == pending.len() {
// Already resolved by channel_monitor_updated; skip to avoid
// duplicate MonitorEvent::Completed.
continue;
}
let monitor_is_pending_updates = monitor_state.has_pending_updates(&pending);
if !monitor_is_pending_updates {
let funding_txo = monitor_state.monitor.get_funding_txo();
let channel_id = monitor_state.monitor.channel_id();
pending_monitor_events.push((
funding_txo,
channel_id,
vec![MonitorEvent::Completed {
funding_txo,
channel_id,
monitor_update_id: monitor_state.monitor.get_latest_update_id(),
}],
monitor_state.monitor.get_counterparty_node_id(),
));
}
}

@ldk-claude-review-bot
Copy link
Collaborator

ldk-claude-review-bot commented Mar 16, 2026

Automated Review

No new issues found beyond prior review passes. The core deferral mechanism (deferred completions in release_pending_monitor_events, lock ordering between deferred_monitor_update_completions and pending_monitor_updates, interaction with channel_monitor_updated) is correct. Event ordering ensures force-close MonitorEvents are processed before deferred Completed events. The per-channel assertion in ChannelManager is a strict improvement over the old global mode check.

Prior inline comments on (1) the removed update_res.is_err() fallback for non-post-close errors, (2) lock granularity in the deferred drain loop, and (3) the Completed doc wording still apply.

🤖 Generated with Claude Code

@TheBlueMatt TheBlueMatt removed their request for review March 16, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

weekly goal Someone wants to land this this week

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants